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^- ' Maximum likelihood estimation of a log-concave probability den- 

O ■ sity is formulated as a convex optimization problem and shown to 

|/' i ■ have an equivalent dual formulation as a constrained maximum Shan- 

non entropy problem. Closely related maximum Renyi entropy esti- 
mators that impose weaker concavity restrictions on the fitted density 
are also considered, notably a minimum Hellinger discrepancy estima- 
tor that constrains the reciprocal of the square-root of the density to 
be concave. A limiting form of these estimators constrains solutions 
to the class of quasi-concave densities. 

C$ • 1. Introduction. Our objective is to introduce a general class of shape 

constraints applicable to the estimation of probability densities, multivari- 
ate as well as univariate. Elements of the class are represented by restricting 

^S) . certain monotone functions of the density to lie in convex cones. Maximum 

likelihood estimation of log-concave densities constitutes an important spe- 
cial case; however, the wider class allows us to include a variety of other 

q | shapes. A one parameter subclass modeled on the means of order p studied 

by Hardy, Littlewood and Polya (1934) incorporates all the quasi-concave 

^. . densities, that is, all densities with convex upper contour sets. Estimation 

methods for these densities, as described below, bring new opportunities for 
statistical data analysis. 

Log-concave densities play a crucial role in a wide variety of probabilistic 
models: in reliability theory, search models, social choice and a broad range 
of other contexts it has proven convenient to assume densities whose loga- 



rithm is concave. Recognition of the importance of log-concavity was already 
apparent in the work of Schoenberg and Karlin on total positivity beginning 
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in the late 1940s. Karlin (1968) forged a link between log-concavity and 
classical statistical properties such as the monotone likelihood ratio prop- 
erty, the theory of sufficient statistics and uniformly most powerful tests. 
Maximum likelihood estimation of densities constrained to be log-concave 
has recently enjoyed a considerable vogue with important contributions of 
Walther (2001, 2002, 2009), Pal, Woodroofe and Meyer (2007), Rufibach 
(2007), Diimbgen and Rufibach (2009), Balabdaoui, Rufibach and Wellner 
(2009), Chang and Walther (2007) and Cule, Samworth and Stewart (2010), 
among others. 

Log-concave densities are constrained to exhibit exponential tail behav- 
ior. This restriction motivates a search for weaker forms of the concavity 
constraint capable of admitting common densities with algebraic tails like 
the t and F families. The p-concave densities introduced in Section 2 con- 
stitute a rich source of candidates. While it would be possible, in principle, 
to consider maximum likelihood estimation of such densities, duality con- 
siderations lead us to consider a more general class of maximum entropy 
criteria. Maximizing Shannon entropy in the dual is equivalent to maximum 
likelihood for the leading log-concave case, but other entropies are also of 
interest. Section 3 describes several examples arising in the dual from the 
class of Renyi entropies, each corresponding to a distinct specification of the 
concavity constraint, and each corresponding to a distinct fidelity criterion 
in the primal. The crucial advantage of adapting the fidelity criterion to 
the form of the concavity constraint is that it assures a convex optimization 
problem with a tractable computational strategy. 

2. Quasi-concave probability densities and their estimation. A probabil- 
ity density function, /, is called log-concave if —log/ is a (proper) convex 
function on the support of /. We adhere to the usual conventions of Rock- 
afellar (1970), which allow convex functions to take infinite values — although 
we will allow only +oo, because all our convex functions will be proper. The 
domain of a convex (concave) function, dom g, is then the set of x such that 
g(x) is finite. We adopt the convention — logO = +oo. 

Unimodality of concave functions implies that log-concave densities are 
unimodal. An interesting connection in the multivariate case was pointed 
out by Silverman (1981): the number of modes of a kernel density estimate 
is monotone in the bandwidth when the kernel is log-concave. However, 
as illustrated by the Student t family, not every unimodal density is log- 
concave. Laplace densities, with their exponential tail behavior, are; but 
heavier, algebraic tails are ruled out. This prohibition motivates a relaxation 
of the log-concavity requirement. 

2.1. A hierarchy of p-concave functions. A natural hierarchy of concave 
functions can be built on the foundation of the weighted means of order 
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p studied by Hardy, Littlewood and Polya (1934): for any p in the unit 
simplex, S = {p £ R n \p > 0, £>» = 1}, let 



M p (a;p) =M p (ai,...,a n ;p)= I ^Pitf 
for p t^ 0; the limiting case for p = is 

M (a;p) = M p (a 1 ,...,a n ;p) = J 



1/p 



af 



The familiar arithmetic, geometric and harmonic means correspond to p 
equal to 1, and — 1, respectively. Following Avriel (1972), a nonnegative, 
real function /, defined on a convex set C C M. d is called p-concave if for any 
xq,x± £ C and p £ S, 

fipoxo+pixi) > M p (f(x ),f(x 1 );p). 

In this terminology, log-concave functions are 0-concave and concave func- 
tions are 1-concave. As M p (a,p) is monotone, increasing in p for a > and 
any p £ 5, it follows that if / is p-concave, then / is also p'-concave for any 
p' < p. Thus, concave functions are log-concave, but not vice-versa. In the 
limit — oo, concave functions satisfy the condition 

fipoxo+pixi) >mm{f(x )J(x 1 )}, 

so they are (and consequently for all p-concave functions) quasi- concave. 

The hierarchy of p-concave density functions was considered in the eco- 
nomics literature by Caplin and Nalebuff (1991) in spatial models of voting 
and imperfect competition; their results reveal some intriguing connections 
to Tukey's half-space depth in multivariate statistics; see Mizera (2002). Cu- 
riously, it appears that the first thorough investigation of the mathematical 
concept of quasi-concavity was carried out by de Finetti (1949). Further de- 
tails and motivation for p-concave densities can be found in Prekopa (1973), 
Borell (1975) and Dharmadhikari and Joag-Dev (1988). 

2.2. Maximum likelihood estimation of log-concave densities. Suppose 
that X = {Xi, . . . ,X n } is a collection of data points in R such that the 
convex hull of X, H(X), has a nonempty interior in W d ; such a configura- 
tion occurs with probability 1 if n > d and the Xi behave like a random 
sample from /o, a probability density with respect to the Lebesgue measure 
on Mr. Viewing the AVs as a random sample from an unknown, log-concave 
density /o, we can find the maximum likelihood estimate of /o by solving 

v. 
(2.1) I f(Xi) = max! such that / is a log-concave density. 
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It is convenient to recast (2.1) in terms of g = —log /, the estimate becoming 

n „ 

(2.2) y^g(Xi) = min! such that g is convex and / e~ 9 ^ x ' dx = 1. 

The objective function of (2.2) is equal to +oo, given the convention adopted 
above, unless all Xi are in the domain of g. As in Silverman (1982), it proves 
convenient to move the integral constraint into the objective function, 

1 n f 

(2.3) — } g(Xi) + / e~ 9 ^ x > dx = min! such that g is convex, 
n^ J 9 

a device that ensures that the solution integrates to one without enforcing 
this condition explicitly. Apart from the multiplier 1/ra, the crucial difference 
between (2.2) and (2.3) is that the latter is a convex problem, while the 
former not. 

It is well known that naive, unrestricted maximum likelihood estimation 
is doomed to fail when applied in the general density estimation context: 
once "log-concave" is dropped from the formulation of (2.1), any sequence 
of putative maximizers is attracted to the the linear combination of point 
masses situated at the data points. One escape from this "Dirac catastrophe" 
involves regularization by introducing a roughness, or complexity, penalty; 
various proposals in this vein can be found in Good (1971), Silverman (1982), 
Gu (2002) and Koenker and Mizera (2008). 

Another way to obtain a well-posed problem is by imposing shape con- 
straints, a line of development dating back to the celebrated Grenander 
(1956) nonparametric maximum likelihood estimator for monotone den- 
sities. While monotonicity regularizes the maximum likelihood estimator, 
unimodality per se — somewhat surprisingly — does not. The desired effect is 
achieved only by enforcing somewhat more stringent shape constraints — for 
instance log-concavity, sometimes also called "strong unimodality." An ad- 
vantage of shape constraints over regularization based on norm penalties is 
that it is not encumbered by the need to select additional tuning param- 
eters; on the other hand, it is limited in scope — applicable only when the 
shape constraint is plausible for the unknown density. 

2.3. Quasi-concave density estimation. Expanding the scope of our in- 
vestigation, we now replace e~ g in the integral of the objective function by 
a generic function tp(g) and define 

n 

(2.4) $( 5 ) = -YV*i)+ U{g{x))dx. 

n U J 

The following conditions on the form of ip will be imposed: 
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(Al) ip is a nonincreasing, proper convex function on R. 

(A2) The domain of ip is an open interval containing (0, +00). 

(A3) The limit, as r — > +00, of ip{y + tx)/t is +00 for every real y and any 

x<0. 
(A4) ip is differentiable on the interior of its domain. 
(A5) ip is bounded from below by 0, with ip{x) — > when x — > +00. 

The most crucial condition is (Al) ensuring the convexity of <3?. Condition 
(A2) assures that ip{x) is finite for all x > 0, while (A3) is required in the 
proof of the existence of the estimates. The relationship between primal 
and dual formulations of the estimation problem is facilitated by ( A4) , and 
(A5) rules out possible complications regarding the existence of the integral 
J ' ip{g) dx in (2.4), allowing for the convention ^(-l-oo) = 0. In the spirit of the 
Lebesgue integration theory, the integral then exists, although ip{g) does not 
have to be summable: it is either finite [which is automatically true for any 
g convex and ip(g) = e~ 9 ] or +00. In the latter case, the objective function 
<&(g) is considered to be equal to +00; $(<?) is also +00 if g(Xi) = +00 for 
some Xi, which occurs unless all Aj lie in the domain of g. On the other 
hand, any g equal to some positive constant on T~L{X) and +00 elsewhere 
yields &(g) < 00. 

A rigorous treatment without assumption (A5), that is, for functions ijj 
not bounded below, would introduce technicalities involving handling of the 
integrals in the spirit of singular integrals of calculus of variations, a strategy 
resembling the contrivance of Huber (1967) of subtracting a fixed quantity 
from the objective function to ensure finiteness of the integral. Although we 
do not believe that such formal complications are unsurmountable, we do 
not pursue such a development. 

Careful deliberation reveals that replacing g by its closure (lower semi- 
continuous hull) does not change the integral term in (2.4), and potentially 
only decreases the first term; this means that without any restriction of its 
scope, we may reformulate the estimation problem as 

1 n f 

(2.5) — y g(Xi) + / ip(g(x)) dx = min! subject to g G /C, 

n f-f J 9 

where /C stands for the class of closed (lower semicontinuous) convex func- 
tions on R . 

Unlike in (2.3), ip{g) is not necessarily the estimated density /; the rela- 
tionship of g to / will be revealed in Section 3, together with the motivation 
leading to concrete instances of some possible functions ip. 

2.4. Characterization of estimates. We now establish that the estimates, 
the solutions of (2.5), admit a finite-dimensional characterization, which is 
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a key to many of their theoretical properties. For every collection (X, Y) of 
points Xi £ W 1 and Yi G K, we define a function 

{n n n ~\ 

t=l i=l i=l J 

Any function of this type is finitely generated in the sense of Rockafellar 
(1970), whose Corollary 19.1.2 asserts that it is polyhedral, being the maxi- 
mum of finitely many affine functions, and therefore convex. The convention 
inf = +oo used in (2.6) means that the domain of gtx,Y) is equal to T-L(X). 
If h is a convex function such that h(Xi) < Yi, for all i, then h(x) < grx,Y)( x ) 
for all x; the function g<x,Y) is thus the maximum of convex functions with 
this property — the lower convex hull of points (Xi, Yi). 

For fixed X, we will denote the collection of all functions gixY) OI " the form 
(2.6) by Q{X). The collection (X,Y) determines gix,Y) uniquely, by virtue 
of its definition (2.6). Given X, we call a vector Y with components Yi S M. 
discretely convex relative to X, if there exists a convex function h defined 
on Ti.(X) such that h(Xi) = Yi. Any function g from Q(X) determines a 
unique discretely convex vector Yi = g(Xi). The converse is also true: there 
is a one-one correspondence between Q(X) and T>(X) C W 1 , the set of all 
vectors discretely convex relative to X. 

Theorem 2.1. Suppose that assumption (Al) holds true. For every con- 
vex function h on M, d , there is a function g £ G(X) such that <£(<?) < &(h); 
the strict inequality holds whenever h ^ Q(X) and %{X) has nonempty in- 
terior. 

The theorem shows that it is sufficient to seek potential solutions of (2.5) 
in G(X); this means, due to the one-one correspondence of the latter to 
T>(X), that the optimization task (2.5) is essentially finite dimensional. The 
theorem also justifies the transition to a more convenient optimization do- 
main in the primal formulation appearing in the next section. 

3. Duality, entropy and divergences. The conjugate dual formulation of 
the primal estimation problem (2.5) conveys a maximum entropy interpreta- 
tion and leads us to several concrete proposals for ip. To conform to existing 
mathematical apparatus, we begin by further clarifying the optimization and 
constraint functional classes of our primal formulation. For definitions and 
general background on convex analysis, our primary references are Rock- 
afellar (1970, 1974) and Zeidler (1985); we may also mention Hiriart-Urruty 
and Lemarechal (1993) and Borwein and Lewis (2006). 
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3.1. The primal formulation. Hereafter, 1C(X) will denote the cone of 
closed (lower semicontinuous) convex functions on T~L(X), the convex hull of 
X. This cone is a subset of C(X), the collection of functions continuous on 
7i(X); it is important that C(X) is a linear topological space, with respect 
to the topology of uniform convergence. Note that G(X) C 1C(X) C C(X). In 
view of Theorem 2.1, any solution of (2.5) is also the solution of 

1 n f 

(3.1) — y^g(Xi)+ / ip(g{x)) dx = min ! subject to g G JC(X), 

2 = 1 

and conversely; thus, we will refer to (3.1) as our primal formulation. 

3.2. The dual formulation. The conjugate of ip is 

ip*(y)= sup (yx-ip(x)). 

xSdomr/) 

Since ip is nonincreasing, there are no affine functions with positive slope 
that minorize the graph of ip, hence ip*{y) = +oo for all y > 0. If -0 is differ- 
entiable on the (nonempty) interior of its domain, then ifj* can be obtained 
using differential calculus — as the Legendre transformation of ip; denoting 
the derivative tp' by x-, we have 

(3.2) r{y) = yx-\y)-^(x- 1 {y)), 

where X~ 1 {y) is an y solution, z, of the equation x( z ) = V- The (topological) 
dual of C(X) is C*(X), the space of (signed) Radon measures on T-L(X); its 
distinguished element is P n , the empirical measure supported by the data 
points Xj. The polar cone to 1C(X) is 



K(xy 



\GeC*{X) f gdG<0 for all g€K(X)\. 



Theorem 3.1. Suppose that assumptions (Al) and (A2) hold. The 
strong (Fenchel) dual of the primal formulation (3. 1) is 

ip*(—f(y)) dy = max! subject to f = — , 

(3.3) f V 

GeJC{X)~, 

in the sense that the value, $(g), of the primal objective for any g satisfy- 
ing the constraints of (3.1), dominates the value, for any f satisfying the 
constraints of (3.3), of the objective function in (3.3); the minimal value of 
(3.1) and maximal value of (3.3) coincide. Moreover, there exists f attain- 
ing the maximal value of (3.3). Any dual feasible function f , that is, any 
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/ satisfying the constraints of (3.3) and yielding finite objective function of 
(3.3), is a probability density with respect to the Lebesgue measure: / > 
and J f dx = 1. If condition (A4) is also satisfied, then the dual and primal 
optimal solutions satisfy the relationship f = —tp'(g). 

It should be emphasized that the expression of absolute continuity in (3.3) 
is a requirement on F = P n — G; the dual objective function is defined as 
the conjugate to the primal objective function <£, and is equal to — oo for 
any Radon measure that is not absolutely continuous with respect to the 
Lebesgue measure. This is how regularization operates here: only those F 
qualify for which P n gets canceled with the discrete component of G. Once 
F satisfies this requirement, its density integrates to 1, as shown in the 
proof of Theorem 3.1. The nonnegativity for / yielding finite dual objective 
function is the consequence of ip*(—y) being infinite for y < 0. In practical 
implementations, it may be prudent to enforce / > in the dual explicitly 
as a feasibility constraint. 

3.3. The interpretation of the dual. An immediate consequence of The- 
orem 3.1 is that we can reformulate the maximum likelihood problem posed 
in (2.3) as an equivalent maximum (Shannon) entropy problem. 

Corollary 3.1. Maximum likelihood estimation of a log-concave den- 
sity as posed in (2.3) has an equivalent dual formulation 



I f(y) ^ Ky)dy = "r 1 «*«**>'■■ 



d(P n - G) 



dy 
(3.4) 

G€lC(X)-, 

whose solution satisfies the relationship f = e~ 9 , where g is the solution of 
(2.3). In particular, the solution of (2.3) satisfies f e~ g ( x > dx = 1, therefore 
problems (2.2) and (2.3) are equivalent. 

The emergence of the Shannon entropy is hardly surprising — in view of 
the well-established connections of maximum likelihood estimation to the 
Kullback-Leibler divergence and maximum entropy. Note that the dual cri- 
terion can be also interpreted as choosing the / closest in the Kullback- 
Leibler divergence to the uniform distribution on T-L(X), from all / satisfying 
the dual constraints. 

3.4. Renyi entropies. While the outcome of Corollary 3.1, the equiva- 
lence of (2.2) and (2.3), could be also shown by elementary means, it is 
important to emphasize that the real value of the dual connection lies in the 
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Fig. 1. Primal ip (left) and dual tp* (right) for selected a > from the Renyi family of 
entropies. 



vista of new possibilities it opens. To explore the link to potential alterna- 
tives, we consider the family of entropies originally introduced for a > by 
Renyi (1961, 1965), 



(3.5) 



(1 - a)- 1 \og( J f a (x)dx\ o^l, 



as an extension of the limiting case for a = 1, the Shannon entropy. For 
a/1, maximizing (3.5) over / is equivalent to the maximization of 



(3.6) 



<**(!-«) f nx)dx = - sgn{a -i)[^M d 



o 



o 



The dependence of convexity/concavity properties of y a necessitates a sep- 
arate treatment of the cases with a > 1, when the conjugate pair is 

(-xf/[3, 
0, 



for x < 0, 
for x > 0, 



ijj{x) ■ 
and the cases with a < 1, where 
V>( x ) : 



</>*(y) 



(-y) a /a, for y < 0, 
+oo, for y > 0, 



+oo, 
-x?/P, 



for a; < 0, 
for x > 0, 



V>*(y) 



-(-y) a /a, fory<0, 
+oo, for y > 0, 



where (3 and a are conjugates in the usual sense that 1//3 + 1/a = 1. See 
Figure 1. 

The general form of the primal formulations (3.1) corresponding to (3.5) 
can be written, for a^l, in a unified way as 



(3.7) 



lt^) + ^J\^)fd X = ^nl 



together with the relation between the dual and primal solutions, / = \g\@ 1 . 
Several particular instances merit special attention. 
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3.5. Power divergences. For a > 1, we may write (—g) instead of \g\, and 
then introduce h = —g. The resulting primal formulation is 

1 n If 

(3.8) -- y^h(Xi) + - / h fi (x)dx = min ! subject to h G K,(X). 

n^—l P J hec(x) 

2 = 1 

By Theorem 2.1, this formulation is equivalent to 

1 n If 

(3.9) - - > h(Xi) + - / h,P(x) dx = min! subject tohelC. 

n ~[ P J h 

After substituting f'-'v'—*-) f or /^ multiplying by /3, and rewriting in terms 
of a we obtain a new objective function 



(3-10) - ( -^±Y j f«-\x i ) + J r{x)dx, 



a 



which recalls the "minimum density power divergence estimators," proposed, 
for a > 1, by Basu et al. (1998) in the context of estimation in parametric 
families. 

3.6. Pearson x 2 ■ Although a = 2 is a special case of the power divergence 
family mentioned above, it deserves a special mention. The choice of a = 2 
in the Renyi family leads to the dual formulation 

(3.11) - f f 2 (y)dy = max.\ subject to / = d ^ Pn ~ G > , GeK{X)~. 

The primal formulation can be written, after the application of Theorem 
2.1, in a particularly simple form 

1 ■J 1 \ 1 f 

(3.12) — N g(Xi) H — / (7 (x) da; = min! subject to 5 G /C, 
n^ 2J g 

i=i 

which can be interpreted as a variant of the minimum Pearson x 2 criterion. 
A similar theme can be found in the dual, which can be interpreted as 
returning among all densities satisfying its constraints the one with minimal 
Pearson \ 2 distance to the uniform density on C(X). 

The relation between primal and dual optimal solutions is / = —g; the 
convexity constraint on g therefore implies that / must be concave. Re- 
placing g in (3.12) by — / and appropriately modifying the cone constraint 
gives a variant of the "least-squares estimator," studied by Groeneboom, 
Jongbloed and Wellner (2001) and going back at least to Birge and Massart 
(1993); the estimate was defined to estimate a convex (and decreasing) den- 
sity on 1R + , a domain that is apparently still under the scope of Theorem 
2.1. 
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3.7. Hellinger. While the form of the objective function for a = 2 has 
some computational advantages, its secondary consequence — constraining 
the density itself to be concave rather than its logarithm — is not at all 
appealing. Indeed, all Renyi choices with a > 1 impose a more restrictive 
form of concavity than log-concavity. From our perspective, it seems more 
reasonable to focus attention on weaker forms of concavity, corresponding 
to a < 1. Apart from the celebrated log-concave case a = 1, a promising 
alternative would seem to be Renyi entropy with a = 1/2. This choice in the 
Renyi system leads to the dual 

(3.13) f V7(y) dy = max! subject to / = d ^ Pn ~ G > ; GeK,(X)~, 

J i ny 

and primal, again after the application of Theorem 2.1, 

1 >. n \ I 1 

(3.14) — } g(Xi)+ / , . dx = min! subject to g £ /C. 
nff J 9{x) 9 

The estimated density satisfies / = 1/g 2 , which means that the primal con- 
straint, g G /C, enforces the convexity of g = 1/y/J. In the terminology of 
Section 2, the estimated density is now required to be only — 1/2-concave, 
a significant relaxation of the log-concavity constraint; in addition to all 
log-concave densities, all the Student t v densities with u > 1 satisfy this re- 
quirement. The dual problem (3.13) can be interpreted as a Hellinger fidelity 
criterion, selecting from the cone of dual feasible densities the one closest in 
Hellinger distance to the uniform distribution on H(X). 

3.8. The frontier and beyond? Although the original Renyi system was 
confined to a > 0, a limiting form for a = can be obtained similarly to the 
a = 1 case. It yields the conjugate pair 

+oo, for x < 0, 

— 1/2 — logo;, for x > 0, 



ijj{x) 



-l/2-log(-y), fory<0, 
+oo, for y > 0. 

As is apparent from Figure 1, this ip violates our condition (A5), but may 
nevertheless deserve a brief consideration. Note first that the possible com- 
plications with existence of integrals may occur only in the formulation (2.5) 
with unbounded domain — not in (3.1), where all integrals are of bounded 
functions over a compact domain. The major technical complications with 
ip violating (A5) concern theorems in Section 4, and are briefly discussed 
there. Here we mention only that the resulting dual, adapted directly from 
(3.3), is 

[logf(y)dy = max.l subject to / = ^ P " - ^ , GelC(X)~, 

J f dy 
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and the primal becomes 



1 " f 

— } giXA — logg(x)dx = min ! subject to g S fC(X). 
nf-i J geC(X) 

z=l 

In this case g = 1//, and the estimate is constrained to be — 1-concave, a 
yet still weaker requirement that admits all of the Student t v densities for 
u>0. 

If we interpret the dual problem (3.4), for a = 1, as choosing a constrained 
/ to minimize the Kullback-Leibler divergence of / from the uniform dis- 
tribution on T-L(X), we can similarly interpret the a = dual as minimizing 
the reversed Kullback-Leibler divergence. In parametric estimation, the lat- 
ter objective is sometimes associated with empirical likelihood, while the 
former is associated with exponentially tilted empirical likelihood. See, for 
example, Hall and Presnell (1999) for related discussion in the context of 
kernel density estimation, and Schennach (2007). 

One might try to continue in this fashion marching inexorably toward 
weaker and weaker concavity requirements. There appears to be no obstacle 
in considering a < 0; the general form (3.7) of the primal is still applicable. 
The shape constraints corresponding to negative a encompass a wider and 
wider class of quasi-concave densities, eventually arriving at the — oo-concave 
constraint, at which point we would have sanctioned all of the quasi-concave 
densities. But formal complications, as well as computational difficulties 
dictate the more prudent strategy of restricting attention to a > cases. 

4. Existence and Fisher consistency of estimates. Returning to our gen- 
eral setting, existence, uniqueness and Fisher consistency are established 
under mild conditions on the function ip. 

4.1. Existence of estimates. Theorem 2.1 not only justifies the choice 
of the optimization domain in (3.1), but also shows, due to the one-one 
correspondence between Q{X) and T>(X), that the optimization task (3.1) 
is essentially finite dimensional, parametrized by the values Yi = g{Xi). This 
facilitates the proof of the following existence result. 

Theorem 4.1. Suppose that assumptions (Al), (A2), (A3) and (A5) 
hold, and that %{X) has a nonempty interior. Then the formulation (2.5) 
has a solution g G C(X); if t/j is strictly convex, then this solution is unique. 

4.2. Fisher consistency. In our general setting, a comprehensive asymp- 
totic theory for the proposed estimators remains a formidable objective. 
Considerable recent progress has been made on theory for the univariate log- 
concave (a = 1) maximum likelihood estimator: Pal, Woodroofe and Meyer 
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(2007) proved consistency in the Hellinger metric, Diimbgen and Rufibach 
(2009) prove consistency in the supremum norm on compact intervals, and 
Balabdaoui, Rufibach and Wellner (2009) derive asymptotic distributions. 
For maximum likelihood estimators in M. d , Cule and Samworth (2010) es- 
tablish consistency for estimators of a log-concave density, and Seregin and 
Wellner (2009) for estimators of convex-transformed densities. These results 
are surely suggestive of the plausibility of analogous results for other a and 
dimensions greater than one. However, the highly technical nature of the 
proofs, and their strong reliance on special features of the univariate setting 
indicate that such a development may not be immediate. 

While anything else in this direction may be viewed as speculative, Fisher 
consistency, a crucial prerequisite for a more detailed asymptotic theory, 
can be verified in a quite straightforward manner and essentially complete 
generality. For differentiable ip, Theorem 3.1 gives the relationship between 
the solution g of the optimization task (3.1) and the density estimate: / = 
—'ip'(g). Using the notation \ for tp' , and x _1 f° r its inverse, as in Section 3, 
we may write g = x _1 ( — /)) an d subsequently rewrite the formulation (2.5) 
in terms of the estimated density / (omitting, for brevity, the integration 
variables) 



J x ~\-f)dP n + J \p(x-\-f))dx 



min! 
/ 
(4.1) 

subject to x 1 (~f) £ K-- 

This yields a new objective function — which we nevertheless denote, slightly 
abusing the notation, also $. The population version of this <I> is obtained 
by replacing dP n by f dx: 

(4.2) $„(/)= fx~ l {-f)k + ^{x-\-f))dx. 



The Fisher consistency for an estimator defined by solving (4.1) requires 
that $o(/o) < 3>o(/)> f° r every /; however, there may be a formal problem 
now with the existence of the integral in (4.2), as x~ 1 (f) may take both 
positive or negative values. A possible way of handling this obstacle is the 
strategy of Huber (1967), briefly mentioned in Section 2: instead of $, we 
consider a modified objective function 

(4.3) $(/) = J (x-H-f) + ^j^) dP - + j^(x-\-f)) dx, 

which, when minimized over / satisfying the constraint of (4.1), yields an 
optimization problem equivalent with (4.1), since the difference of $ and $ 
is constant in /. However, the population version of $ 

(4.4) |. (/)= f X'H-f)fo + r(-fo)+^(x'H-f))dx 
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is now better suited for the ensuing version of the Fisher consistency theo- 



rem. 



Theorem 4.2. Suppose that ip satisfies assumptions (Al), (A2), (A4) 
and (A5). The integrand in (4-4) * s then nonnegative for any probability den- 
sity f such that x~ 1 (—f) £ K-> an d identically equal to for f = /o; therefore, 
= $o(/o) < *&o(f)> where $o(/) is well defined for every f , possibly equal 

to +oo. 

In fact, Theorem 4.2 can be proved in the same manner for the unmodified 
$, if dom ijj = (uj, +oo) with u > — oo. Then the inverse of % = i//, and hence 
the range of x _1 is bounded from below by u. In such a case, x(f)fo > w /o> 
so the first term in (4.2) is minorized by an integrable function u)fo; the 
second term is bounded from below by by assumption (A5), so the whole 
integral then exists in the Lebesgue sense, being either finite or equal to 
+oo. 

If, however, assumption (A5) is not satisfied, then the existence of the 
integral should be assumed explicitly; we return to this point briefly at 
the end of the proof of Theorem 4.2. Note that, by comparing (3.2) and 
(4.2), existence of the integral is equivalent to assuming the integrability 
(summability) of 

(4.5) foX'Hfo) + Hx-H-fo)) = -r(-fo), 

that is, the existence and finiteness of the entropy term in the dual (3.3). 

5. Examples of practical use. We employed two independent algorithms 
for solving the convex programming problems posed above: mskscopt from 
the MOSEK software package of Andersen (2006), and the PDCO MATLAB 
procedure of Saunders (2003). Both algorithms are coded in MATLAB and 
employ similar primal-dual, log-barrier methods. Further details regarding 
numerical implementation appear in Appendix B. The crux of both algo- 
rithms is a sequence of Newton-type steps that involve solving large, very 
sparse least squares problems, a task that is very efficiently carried out 
by modern variants of Cholesky decomposition. Several other approaches 
have been explored for computing quasi-concave density estimators that 
are log-concave. An active set algorithm for univariate log-concave density 
estimation was described in Diimbgen, Hiisler and Rufibach (2007) and im- 
plemented in the R package logcondens of Rufibach and Diimbgen (2009). 
Cule, Samworth and Stewart (2010) have recently implemented a promising 
steepest descent algorithm for multivariate log-concave estimation that may 
be adaptable to other quasi-concave density estimation problems. 
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5.1. Univariate example: Velocities of bright stars. To illustrate the ap- 
plication of the foregoing methods, we briefly consider some realistic exam- 
ples. Our first example features data similar to those considered by Pal, 
Woodroofe and Meyer (2007), the type of data where shape constraints 
sometimes arise in a natural manner. The two samples consist of 9092 mea- 
surements of radial and 3933 of rotational velocity for the stars from Bright 
Star Catalog, Hofneit and Warren (1991). The left and right panels of Figure 
2 show the results for the radial and rotational velocity samples, respectively. 

The broken line in the upper panels shows kernel density estimates, each 
time with default MATLAB bandwith selection; the solid lines correspond 
to one of the norm penalized estimates proposed in Koenker and Mizera 
(2008): maximum likelihood penalized by the total variation of the second 
derivative of the logarithm of the estimated density. This is the L 1 version of 
the Silverman's (1982) estimator penalizing the squared L 2 norm of the third 
derivative. The smoothing parameter for the latter estimate was set quite 




50 500 550 500 550 500 550 500 550 



Fig. 2. The estimates of the densities of radial (left) and rotational (right) velocities of 
the stars from the Bright Star Catalog. Broken lines are kernel density estimates in the 
upper two panels, and the solid lines are total variation penalized estimates. In the lower 
two panels the broken lines are the log-concave estimates and the solid lines represent the 
Bellinger (—1/2-concave) estimates. 
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arbitrarily at 1; it seems that this arbitrary choice works quite satisfactorily 
here, providing — somewhat surprisingly, for both samples — about the same 
level of smoothing as the kernel estimator. For the radial velocity sample, the 
two estimates are essentially the same. For the rotational velocity sample, 
however, the right upper panel shows that the kernel density estimate differs 
substantially from the penalized one. Both estimators have the unfortunate 
effect of assigning considerable mass to negative values despite the fact that 
there are no negative observations. This effect is somewhat more pronounced 
for the kernel estimate. 

Since the preliminary analyses of the upper panels indicates that the 
hypothesis of unimodality is plausible for both of the datasets, a natural 
next step is the application of a shape-constrained estimator — a move that, 
among other things, may relieve us of insecurities related to the arbitrary 
choice of smoothing parameters. The broken line in the lower panels of Figure 
2 shows the log-concave maximum likelihood (a = 1), and the solid line the 
Hellinger (— 1/2-concave) estimate (a = 1/2). While, as expected, there is 
almost no difference between the two (and in fact, among all four) estimates 
for the radial velocity dataset, the right lower panel reveals that the log- 
concave estimate yields for the rotational velocity sample a density that is 
monotonically decreasing — which contradicts the evidence suggested by all 
other methods. The Hellinger estimate, on the other hand, exhibits a subtle, 
but visible bump at the location of the plausible mode, thus turning out to 
be visually more informative about the center of the data than the tails. This 
is somewhat paradoxical given its original heavy-tail motivation, confirming 
that the real universe of data analysis can be much more subtle than that 
of the surrounding theoretical constructs. 

5.2. Bivariate example: Criminal fingers. To illustrate our approach in 
a simple bivariate setting, we reconsidered the well-known MacDonell (1902) 
data on the heights and left middle finger lengths of 3000 British criminals. 
This data was employed by Gosset in preliminary simulation work described 
in Student (1908). 

Figure 3 illustrates the Hellinger (—1/2-concave) fit of this data. Contours 
are labeled in units of log-density A notable feature of the data is the unusual 
observation in the middle of the upper edge. This point is highly anomalous, 
at least for any density with exponential tail behavior. The maximum likeli- 
hood estimate of the log-concave model in Figure 4 has very similar central 
contours, but the outer contours fall off much more rapidly implying that 
the log-concave estimate assigns much smaller probability to the region near 
the unusual point. 

5.3. Some simulation evidence. Motivated by a suggestion of one of the 
referees, we undertook some numerical experiments to explore performance 
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9.5 9. 



Fig. 3. Hellmger (—1/2-concave) estimate of the density of Student's criminals. Con- 
tours are labeled in units of log-density. 



of our shape constrained estimators and evaluate whether consistency ap- 
peared to be a plausible conjecture. For the log-concave estimator Pal, 




Fig. 4. Log-concave estimate of the density of Student's criminals. Contours are labeled 
in units of log- density. 
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Fig. 5. Comparison of estimators of several log-concave densities. 



Woodroofe and Meyer (2007) report "Hellinger error" for a fully crossed 
design involving five target densities and five sample sizes with 500 replica- 
tions per cell. 

In Figure 5, we report results of our attempt to reproduce the PWM 
experiment expanded somewhat to consider two competing estimators: the 
adaptive kernel estimator of Silverman (1986) using a Gaussian kernel, and 
the logspline estimator of Kooperberg and Stone (1991) as implemented in 
the logspline R package of Kooperberg. Five target densities are consid- 
ered: (standard) normal, Laplace, Gamma(3), Beta(3, 2) and Weibull(3,l) 
as in PWM. Five sample sizes are studied: 50, 100, 200, 500, 1000. And two 
measures of performance are considered: squared Hellinger distance as in 
PWM in the left panel and L\ distance in the right panel. Plotted points in 
these figures represent cell means. Both figures support the contention that 
the rates of convergence are comparable for all three estimators. 

Figure 6 reports results from a similar experiment for the the —1/2- 
concave estimator described in Section 3.6. We consider five new target 
densities: lognormal, £3, t$, F^q and Pareto(5), all of which fall into the 
— 1/2-concave class. The same competing estimators and sample sizes are 
used. In a small fraction of cases for the second group of densities, less than 
0.2 percent, there were problems either with the convergence of the logspline 
or shape-constrained estimator, or with the numerical integration required 
to evaluate the performance measures, so Figure 6 plots cell medians rather 
than cell means. Again, the figures support the conjecture that the rates of 
convergence for the shape-constrained estimator are competitive with those 
of the adaptive kernel and logspline estimators. 
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Fig. 6. Comparison of estimators of several —1/2-concave densities. 

A concise way to summarize results from these experiments is to estimate 
the simple linear model 

log(yij) = en + p\og(rij) + Uij, 

where yij denotes a cell average of our two error criteria for one of our 
three estimators, for target density i and sample size rij. In this rather 
naive framework, (3 can be interpreted as an empirical rate of convergence 
for the estimator. In Tables 1 and 2, we report these estimates suppressing 
the estimated target density specific ctj's. In this comparison too, the shape 
constrained estimators perform quite well. 

6. Extensions and conclusions. We have described a rather general ap- 
proach to qualitatively constrained density estimation. Log-concave densi- 
ties are an important target class, but other, weaker, concavity requirements 
that permit algebraic tail behavior are also of considerable practical inter- 



Table 1 
Estimated convergence rates for log-concave target densities 



Criterion 



Log-concave 



Kernel 



Logspline 



LI error 
Hellinger 



-0.417 
(0.018) 

-0.875 
(0.032) 



-0.366 
(0.003) 

-0.498 
(0.031) 



-0.393 
(0.012) 

-0.698 
(0.021) 



-0.324 


-0.386 


(0.008) 


(0.01) 


-0.355 


-0.672 


(0.023) 


(0.019) 
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Table 2 
Estimated convergence rates for —1/2-concave target densities 

Criterion — 1/2-concave Kernel Logspline 

LI error -0.405 

(0.004) 
Hellinger -0.751 

(0.034) 



est. Ultimately, the approach accommodates all quasi-concave densities as a 
limit of the Renyi entropy family 

There are many unexplored directions for future research. As we have 
seen, a consequence of the variational formulation of our concavity con- 
straints is that the estimated densities vanish off the convex hull of the 
data. Various treatments for this malady may be suggested. Miiller and Ru- 
fibach (2009) have recently suggested applying one of several estimators of 
the Pareto tail index to the smoothed ordinates from the log-concave pre- 
liminary density estimator. Our inclination would be to prefer solutions that 
impose further regularization on the initial problem. Thus, for example, we 
can add a new penalty term to the primal problem, penalizing the total 
variation of the derivative (gradient) of log/, and choosing a suitable value 
of the associated Lagrange multiplier to smooth the tail behavior at the 
boundary. 

We have adhered, thus far, to the principle that the entropy choice in the 
fidelity criterion of the dual problem should dictate the form of the convex- 
ity constraint: likelihood thus implies log-concavity, Hellinger fidelity implies 
l/V7 concavity, etc. One may wish to break this linkage and consider max- 
imum likelihood estimation of general /9-concave densities. This may have 
some advantages from an inference viewpoint, at the cost of complicating 
the numerical implementation. 

Embedding shape constrained density estimation of the type considered 
here into semiparametric methods would seem to be an attractive option in 
many settings. And, it would obviously be useful to consider inference for 
the validity of shape constraints in the larger context of penalized density 
estimation. We hope to pursue some of these issues in future work. 

APPENDIX A: PROOFS 

Proof of Theorem 2.1. Given h convex, put Y, = h{Xi) and take 
g = grx,Y)i the function defined by (2.6). The convexity of h implies that 
h{x) < g(x) for every x; since ip is nonincr easing, we have 

(A.l) c = ifj(h(x))dx- ip(g(x))dx>0. 
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The definition of gtxy) implies that h{Xi) = g<x,Y){^i) = Yii therefore, the 
rest of $ remains unchanged, and (A.l) implies that &(g) < &{h). 

Suppose that h ^ G{X). Then h 7^ g and the inequality h < g implies that 
doing C dom/j. If domh 7^ domg, then there is a point xq ^ domg from 
the interior of dom/i, because domg is closed. The continuity of h at xq 
implies that h(x) < K < +00 = g{x) for all x from an open neighborhood 
of xq] this proves that c> 0. If domh = domg = %{X), then the polyhedral 
character of Ti(X) implies through Theorem 10.2 of Rockafellar (1970) that 
h is upper semicontinuous relative to T-i(X) at any x G %{X)] that is, if 
h(xo) < g{xo) for some xq G domh, then the inequality holds for all x in an 
open, relatively to Ti(X), neighborhood of xq. Such a relative neighborhood 
has positive Lebesgue measure, due to the fact that the interior of T-L(X) is 
nonempty. Hence, we have c > also in this case and the strict inequality 
$(5) < $(h) follows. □ 

Proof of Theorem 3.1. We use the conventional notation (£,x) to 
denote £(x), the result of the application of a linear functional to x. The 
definition of the conjugate of a convex function H in this notation is 

H*(y)= sup ((y,x)-F(x)); 

xGdomF 

the resulting function is convex itself, being a sup of affine functions. For 
any / G C(X) and any Radon measure G, a linear functional from C(X)*, 
we have 

(G,f)=j fdG. 
We start the proof by rewriting (3.1) as 

n 

~Y,9( X i) + J 1>(j9(x)) dx + i K (X)(g) = Hg) + T( 5 ) = mf !, 

where <£ is the original objective function of (3.1) and T = ijc(x) i s the 
indicator function of IC(X). The expression for the Fenchel dual of this 
type of problem follows from Rockafellar (1966); see also Rockafellar (1974), 
Section 5, Example 11: 

-$*(G)-r*(-G) = max! 

(note that one of the conjugates, in both cites sources, is in the "concave" 
sense, which explains the negative sign of the argument in the second term, 
but not in the first). The conjugate of the indicator of a convex cone 1C(X) 
is the indicator of —JC(X)" [Rockafellar (1974), Section 3, equation (3.14)]. 
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The term — T*(— G) in the objective can be therefore interpreted as a con- 
straint — G £ — fC(X)~ , that is, G S KL{X)~ . The definition of the conjugate 
of <& gives 

®*{G)=su V UG,g)-'^g{X i )- J ^{g{x))dx\ 



(A.2) 



i=l 



suJ (G-P n ,g) - J ip(g(x))dx) = tt*(G- P n ); 



the sup is taken over all g from 

dom<l> = |gG C{X) \ ip(g{x))dx < +00 > = dom^, 

where \I/ is the functional given by 

(A.3) y(g) = J^(g(x))dx, 

and \&* is its conjugate. The form of the latter is given by Rockafellar (1971), 
Corollary 4A: if G is absolutely continuous with respect to the Lebesgue 
measure, then 

(a.4) **(G)=yv*(^w 

otherwise Vl/*(G) = +00. These facts, and expressions (A.2) and (A.4), yield 
(3.3). 

Rockafellar [(1966), Theorem 1; see also Rockafellar (1974), Section 8, 
Example 11'] gives also a constraint qualification for this type of problem: 
to prove strong duality, we need to find some g where both <£ and T are finite 
and one of them is continuous. Such a g is provided by a function constant 
on 'H(X), say g(x) = 1 for all x £ ~H(X). It is convex, thus T(g) = is finite. 
So is &(g); the topology on C(X) is that of uniform convergence, and ip is 
continuous at 1, hence there is a neighborhood of g containing only functions 
for which $ is finite and <I> is continuous at g. 

Once the constraint qualification is verified, we know that the primal 
and dual optimal values coincide (zero duality gap), and that the dual is 
attained — there is an optimal solution to the dual; see Theorem 52.B(3) of 
Zeidler (1985). Due to the fact that ip is decreasing, ip*{—f) = +00 whenever 
/ < 0; thus, if / yields a finite dual objective function, then / is nonnegative. 
If G € K(X)- , then (G, /) < for every / e K(X)\ consequently, 

0>(G,1) = -(G,-1)>0. 
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Therefore, (G, 1) = and for every dual feasible /, 

J f(x) dx = (P n - G, 1) = (P n , 1) - (G, 1) = J 1 dP n = 1. 

That is, every dual feasible / is a probability density with respect to the 
Lebesgue measure. 

If a primal solution, g, exists — the fact that is established by Theorem 
4.1, but here we are exploring only the consequences of such a premise — it 
is related to the dual solution, /, via extremality (Karush-Kuhn-Tucker) 
conditions. The form of this relationship asserted by the theorem follows 
from the second condition of (8.24) in Rockafellar [(1974), Section 8, Exam- 
ple 11'], together with the form of the subgradient of ^f given by Rockafellar 
[(1971), Corollary 4B], combined with the fact established above that the 
estimated density / corresponds to F = P n — G. □ 

Proof of Theorem 4.1. By Theorem 2.1, any potential solution of 
(2.5) lies within the class G(X) of polyhedral functions; due to the one-one 
correspondence between Q{X) and V(X), the set of vectors discretely convex 
relative to X, the existence proof can be carried for (3.1) reparametrized by 
Yi, the putative values of g(Xi). In what follows, X remains fixed, and a, 
(5 will denote generic coefficients of convex combinations: any real numbers 
satisfying a, (3 > 0, a + /? = 1. 

As the correspondence between Q{X) and T>{X) is not a linear map- 
ping (except for d = l), the first thing to be shown is that (3.1) remains 
a convex problem when reparametrized in terms of a vector Y S W 1 , with 
components Y±, . . . ,Y n . The resulting problem minimizes, over Y £ W 1 , the 
objective function 

1 n f 

$v(Y) = -Y,^+ h(9(x,Y)(x))dx ifYeV(X) 

n t=i ^ 

= +oo otherwise. 

Note first that V{X) is a convex subset of W 1 : if Y, Z G V{X), then there ex- 
ist convex functions g, h satisfying Yj = g{Xi) and Z% = h(Xi); subsequently, 
aYi + (3Zi = ag(Xi) + /3h(Xi), and as ag + f3h is also a convex function, we 
obtain that aY + f3Z € *D(X) for any convex combination of Y and Z. Thus, 
it is sufficient to show that 3>x> is convex on D, which amounts to demonstrat- 
ing the convexity of the function Y \— > fip(gtx,Y)( x ))dx- Let Y,Z£ V(X); 
the definition (2.6) gives for their convex combination 

{n n n > 

^2\i(aYi + f3Zi) x = ^2XiXi,^2\i = l,Xi> > 
i=\ i=\ i=\ J 
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>amd^2 X ^ x = J2 XiXi iJ2 Xi = 1 > Xi ^ 



i=l 



j=l i=l 



i=l 



+ /?inf<^A 4 Z J s = ^A i X i ,5^Ai = l,A i >0 

I i=l i=l 

= «5(x,y)(^) + /tyx,z)(aO- 
As V> is nonincreasing and convex, we obtain 

1p(d(X,aY+pZ)(x))dx 



< 



(A.5) 



H a 9(x,Y) {x) + /%y,z) {x)) dx 



< \ a^{9{x,Y)(x))+P'ip{g { x,z){x))dx 



a 



^{g{x,Y){x))dx + f3 I t/j(g(x,z)(x))dx 



as was required. Note that the integral is also finite whenever Y has all 
components in the domain of tp, due to the polyhedral character of g<x,Y)i. x ) 
and the fact that ip is nonincreasing and %{X) is bounded. Otherwise, it 
may be equal only to +00; hence *$fx> is a proper convex function. 

Lemma A.l. Suppose thatY,Z are vectors inW 1 such thatY = (y, . . . ,y) 
has constant components, and Z is arbitrary. For any r > 0, 

(A.6) 9(x,Y+rZ) 0*0 = 9(x,Y) ( x ) + T 9(X,Z) (x)=y + rg {x ,z) (x) ■ 

Proof. Note first that by the definition, grx,Y)( x ) = y identically on 
H(X) for constant Y; likewise, for every x £ H(X), 

{n n n ~\ 

^XiiXi + rZi) x = ^2\iXi,^2Xi = l,Aj >0 > 
*=i i=i i=i J 

( n n n ") 

= inf< y + r^XiZi g = y^ AjXj, y^ Aj = 1, Aj >0 > 



«=i 



j=l i=l 



= y + rinf<^ ^AjZj a = ^ AjXj, ^ Aj = 1, Aj > 

I. «=1 «=1 i=l > 

= 9(x,Y)(x)+Tg {x ,z)(x), 
proving the lemma. D 
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Choose a real number y lying in the domain of ip, and set Y = (y, . . . ,y). 
According to Lemma A.l, gtx.Y){ x ) = V f° r every x € T-L(X); the function 
constant on %{X) is convex, hence Y £ T>(X). Then 

n 

&d(X) = -Y, Y i + / H9(x,Y)(x))dx = y + V(y) Vol(H(X)) < +00; 

therefore, Y lies in the domain of <&j>. For arbitrary Z £ M" - , not identically 
0, and r > 0, we have 

n 

$ V (Y + tZ) = - J2(Yi + TZi)+ 4>(g {xx+TZ) (x)) dx 

71 i=X ^ 

T Tl f 

= y + -/^2 z i+ nP(y + T 9(x,z) 0)) dx 

=v+ t \^22 z *+J - T dx y 

We know that $x> is a convex function on a finite-dimensional linear space 
M. n ; to establish the existence of its minimizer, it is sufficient to show that 
<&1>(Y + tZ) — > +00 for r — > +00, which means that we need to verify that 
the limit of the expression in the parentheses is positive (possibly +00); see 
also Hiriart-Urruty and Lemarechal (1993), Remark 3.2.8. Let E~ , E°, E + 
be sets in H(X) where g(x) = g(x,z)( x ) is i respectively, negative, zero or 
positive; we are to examine the limit behavior of 






*P(y + Tg(x)) ^ 
n ■' 



(A.7) 

^(y\ dx+ [ Mv + rgjx)) ^ 



IE° T JE+ 

when r — > +00. For the integral over E°, the limit is obviously zero. If iJj 
satisfies assumptions (Al) and (A5), then ip is nonincreasing and converging 
to when r — > +00; for every x £ E + then ip(y + Tg(x))/r monotonically 
decreases with increasing r, hence the limit of the integral over E + is zero 
as well. Finally, if ip satisfies also (A3), then for every x S E~ the limit of 
tp(y + T g{ x ))/ T is +00; at the same time, the expression is bounded from 
below by 0, due to (A4). The application of the Fatou lemma then gives that 
the limit of the integral over E~ is +00, whenever E + has positive Lebesgue 
measure. 

The proof of the theorem is then finished by the examination of possible 
alternatives. If the first term in (A.7), the mean of the Zi's, is positive, then 
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the theorem is proved, as all other terms in (A. 7) converge either to or 
+oo. If the first term of (A. 7) is negative or equal to zero, then there must be 
some Zi < (the case with all Zi = is excluded). That means that g(x) = 
9(x.z)( x ) is negative for some x; this implies that E~ has positive Lebesgue 
measure, and then the limit of the second term in (A. 7), the integral over 
E~ , and thus of the whole expression (A. 7) is +oo. This proves the theorem. 

Under the strict convexity of ip, the strict convexity of <3?x> follows (for ap- 
propriate a, (3) from the second inequality in (A. 5), which becomes sharp — 
this is due to the fact that the sharp inequality holds pointwise for all x, 
and thus for the integrals as well. The strict convexity of 3>x> then implies 
the uniqueness of the solution. 

Finally, functions ip satisfying (Al)-(A3), but not necessarily (A5) re- 
quire some special considerations. For the integral over E~ , we have to 
observe that for every r > we have ip(y + rg/x,Z)( x )) > V 7 ! if i'iv) < 0' 
then tp(y)/r > V'(y) f° r every r > 1, if ip{y) > then V ; (2/)/ r > for every 
r > 0. Hence we have also in this case an integrable constant minorant [due 
to the fact that Ti(X) has finite Lebesgue measure]; this justifies the limit 
transition via the Fatou lemma. Finally, for the integral over E + , we need to 
assume that the limit of the integrand is for every x, and find an integrable 
minorant; this may be related to the existence of the integral of the second 
term in (2.5). □ 

Proof of Theorem 4.2. The proof relies on the application of what is 
called Fenchel's inequality by Rockafellar [(1970), page 105] or the (general- 
ized) Young inequality by Hardy, Littlewood and Polya [(1934), Section 4.8], 
or Zeidler [(1985), Section 51.1]. The inequality says says that for arbitrary 
x,y and convex function tjj 

xy<ip(x) +tp*(y). 

Applied pointwise to x = — /o and y = x~ 1 ( — /)> the inequality yields 

(-/o)x _1 (-/) < n-fo) + Hx'H-f)), 

which is equivalent to the nonnegativity of the integrand in (4.4). For / = /o, 
the equality (4.5) implies that 

fox'\-fo) + r(-h) + v>(x -1 (-/o)) = -r(-h) + n-h) = o, 

which proves the theorem. 

For functions not satisfying (A5), integrability of /oX _1 ( — /o) is no longer 
equivalent to that of —ip*(—fo)- However, if we assume the integrability of 
the latter, then the proof can be carried through in the same way. □ 
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APPENDIX B: COMPUTATIONAL DETAILS 

Our computational objective is to provide a unified algorithmic strategy 
for solving the entire class of problems described above. Interior point meth- 
ods designed for general convex programming and capable of exploiting the 
inherently sparse structure of the resulting linear algebra offer a powerful, 
general approach. We have employed two such implementations throughout 
our development process: the PDCO algorithm of Saunders (2003), and the 
MOSEK implementation of Andersen (2006). 

Our generic primal problem (2.5) involves minimizing an objective func- 
tion consisting of a linear component, representing likelihood or some gen- 
eralized notion of fidelity, plus a nonlinear component, representing the in- 
tegrability constraint. Minimization is then subject to a cone constraint 
imposing convexity. We will first describe our procedure for enforcing con- 
vexity, and then turn to the integrability constraint. 

B.l. The convexity constraint. In dimension one convexity of piecewise 
linear functions can be imposed easily by enforcing linear inequality con- 
straints on a set of function values, 7j = #(6) at selected points 6,6,- • ■ , 6n- 
For ordered 6' s > the convex cone constraint can be written as Dj > for a 
tridiagonal matrix D that does second differencing, adapted to the possible 
unequal spacing of the 6' s - 

In dimension two, enforcing convexity becomes more of a challenge. Ide- 
ally, we would utilize knowledge of the polyhedral character of the optimal 
g, established by Theorem 2.1, and implying that the optimal g is piece- 
wise linear over some triangulation of the observations Xi 6R 2 . Once we 
knew the triangulation, it is again straightforward to impose convexity: each 
interior edge of the triangulation generates one linear inequality on the co- 
efficients 7. Unfortunately, the complexity of traveling over a binary tree 
of possible triangulations of the observed points makes finding the optimal 
one difficult. The algorithm implemented in the logConcDEAD package of 
Cule, Gramacy and Samworth (2009), for computing log-concave estimates, 
exploits special features of the log-concave MLE problem and thus does not 
appear to be easily generalizable to our other settings. Finite-element meth- 
ods involving fixed (Delaunay) triangulation of an expanded set of vertices 
were also ultimately deemed unsatisfactory. 

A superior choice, one that circumvents the difficulties of the finite-element, 
fixed triangulation approach, relies on finite differences. Convexity is im- 
posed directly at points on a regular rectangular grid using finite-differences 
to compute the discrete Hessian: 

#n(6,6) = <?(6 + <*,6) - 2<?(6,6) + 3(6 - 5,6), 
#22(6, 6) = 3(6, 6 + <*) -20(6, 6) + 3(6, 6 -&), 
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H w fa,&) = \9(.Si + 8,S2 + S)-gfa + 8,b-6) 

-9{£i-S,t2 + 8)+g(Si-6,S2-S)]/4, 

Convexity is then enforced by imposing positive semidefiniteness. These con- 
straints — convexity at each of the grid points (£u,£2z) — produce a semi- 
definite programming problem. In the bivariate setting the semi-definiteness 
of each H can be reformulated as a rotated quadratic cone constraint; we 
need only constrain the signs of the diagonal elements of H and its de- 
terminant. This simplifies the implementation of the Hellinger estimator in 
MOSEK. For the relatively fine grid used for Figure 3 solution requires about 
25 seconds, considerably quicker than the log-concave estimate of Figure 4 
computed with the implementation of Cule, Gramacy and Samworth (2009). 

B.2. The integrability constraint. For certain special ip, one can evaluate 
the integral term J ip(g(x)) dx in the objective function of (2.5) explicitly — as 
was done for ip(g) = e~ 9 by Cule, Samworth and Stewart (2010). While such 
a strategy may also be possible for certain other specific ip, we adopt a more 
pragmatic approach based on a straightforward Riemannian approximation 



„ m 

(B.l) / ^(s))dx«y>G7(&))si. 

JH(X) ~1 



l-H(X) 

Here, Sj are weights derived from the configuration of £j. Of course, with 
only a modest number of £j's such an approximation may be poor; in di- 
mension one we therefore augment the initial collection of the observed data 
X±,. . . , X n by filling the gaps between their order statistics by further grid 
points, to ensure that the resulting grid (not necessarily uniformly spaced) 
and consisting of the observed data points as well as the new grid points, 
provides a sufficiently accurate approximation (B.l). The Sj's are then sim- 
ply the averages of the adjacent spacings between the ordered £$'s. Given 
the size of problems modern optimization software can successfully handle, 
it is no problem to add an abundance of new points in dimension one. 

In dimension two, the approximation (B.l) is based on the uniformly 
spaced grid of the points used in the finite-difference approach described in 
the previous subsection. As the original data points Xi may no longer lie 
among the grid points £j, we have to modify the fidelity component of the 
objective function: instead of obtaining g(Xi) directly, we obtain it via linear 
interpolation from the values of g at the vertices of the rectangles enclosing 
Xi. As long as the grid is sufficiently fine, the difference is minimal. We use 
this approach often also in dimension one, as it provides better numerical 
stability especially for fine grids and large data sets. 
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B.3. Discrete duality. Adopting the procedures described above, we can 
write the finite-dimensional version of the primal problem as 

(P) {u; T L7 + s T ^(7)|L»7>0}=min!, 

where ^(7) denotes now the m-vector with typical element ip(g(£i)) = tpili)^ 
L is an "evaluation operator" which either selects the data elements from 7, 
or performs the appropriate linear interpolation from the neighboring ones, 
so that L7 denotes the n- vector with typical element, g(Xi), and w is an 
n- vector of observation weights, typically Wi = 1/n. 

Associated with the primal problem (P) is the dual problem 

(D) {-s T ^*{-<j))\S<t> = -w T L + D T i], cp > 0,D T i] > 0} = max!. 

Here, r\ is an to- vector of dual variables and <j> is an m-vector of function 
values representing the density evaluated at the &'s, and S = diag(s). The 
vector ty* is the convex conjugate of Vf defined coordinate-wise with typ- 
ical element ip*(y) = sup x {yx — ip(x)}. Problems (P) and (D) are strongly 
dual in the sense of the following result, which may viewed as the discrete 
counterpart of Theorem 3.1. 

Proposition B.l. If tp is convex and differentiable on the interior X 
of its domain, then the corresponding solutions of (P) and (D) satisfy 

(E) /(6) = ^'(S(6)) fori = l,...,m, 

whenever the elements of g are from I and the elements of f are from the 
image of I under ip' . 

For ^(x) with typical element ip{x) = e~ x we have \F* with elements 
tp*(y) = —ylogy + y, so the dual problem corresponding to maximum likeli- 
hood can be interpreted as maximizing the Shannon entropy of the estimated 
density subject to the constraints appearing in (D). Since g was interpreted 
in (P) as log/, this result justifies our interpretation of solutions of (D) as 
densities provided that they satisfy our integr ability condition. This is eas- 
ily verified and thus justifies the implicit Lagrange multiplier of one on the 
integrability constraint in (P), giving a discrete counterpart of Theorem 3.1. 



Proposition B.2. Let i denote an m-vector of ones, and suppose in 
(P) that w T Lt = 1 and Dl = 0. Then solutions <fi of (D) satisfy s T <fi = 1 and 
cj)>0. 

The crucial element of the proof is that the differencing operator D anni- 
hilates the constant vector and therefore the result extends immediately to 
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other norm-type penalties as well as to the other entropy objectives that we 
have discussed. Indeed, since the second difference operator representing our 
convexity constraint annihilates any affine function it follows by the same 
argument that the mean of the estimated density also coincides with the 
sample mean of the observed X^s. 
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